The acquisition of a speech corpus for limited domain translation

نویسندگان

  • Demetrio Aiello
  • Loredana Cerrato
  • Cristina Delogu
  • Andrea Di Carlo
چکیده

In this paper we report on the first phase of the speech corpus collection for purposes of the ESPRIT LTR project n. 30268, EuTrans. The corpus is intended to provide training material for speaker independent continuous speech recognition and translation over the telephone line, based on a vocabulary of few thousands words. Due to its application the corpus is structured so to contain speech material for acoustic modelling, and textual material for language modelling and translation modelling. The speech material which is being collected, and which we will describe in this paper, has been produced in a natural way. The corpus will be described with the aid of some statistic results obtained to better illustrate the characteristics of the acquired material. We will finally present our future plan for the collection of other parts of the corpus and in particular we will introduce a new "dialogue oriented" collection paradigm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language model acquisition from a text corpus for speech understanding

Speech understanding can be viewed as a problem of translating input natural language of speech recognition results into output semantic language. This paper describes automatic acquisition of a language model for translating natural language into semantic language from a text corpus using a stochastic method. The method estimates co-occurrence probabilities of input and output grammar rules as...

متن کامل

EuskoParl: a speech and text Spanish-Basque parallel corpus

The advances in corpus-based approaches and machine learning techniques have promoted the development of minority languages. The contribution of this work is to acquire a parallel corpus in Spanish and Basque with both text and speech data. In order to be able to compare the systems with those developed for other languages, Europarl corpus was taken as a reference in both domain and size. The a...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Linguistic representation of Finnish in a limited domain speech-to-speech translation system

This paper describes the development of Finnish linguistic resources for use in MedSLT, an Open Source medical domain speech-to-speech translation system. The paper describes the collection of the medical sub-domain corpora for Finnish, the creation of the Finnish generation grammar by adapting the original English grammar, the composition of the domain specific Finnish lexicon and the definiti...

متن کامل

Statistical Approach to Chinese-english Spoken- Language Translation in Hotel Reservation Domain

This paper investigates a preliminary translation system from Chinese to English based on the statistical approach and tests its performance on a limited-domain spoken-language task: hotel reservation. A bilingual corpus is available for the task, which exhibits some typical phenomena of spontaneous speech. The experiments are performed on both the text transcription and the speech recognizer o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999